Skip to content

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70

Merged
ishandhanani merged 4 commits intoNVIDIA:mainfrom
elvischenv:dsv4-pro-recipes
Apr 24, 2026
Merged

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70
ishandhanani merged 4 commits intoNVIDIA:mainfrom
elvischenv:dsv4-pro-recipes

Conversation

@elvischenv
Copy link
Copy Markdown
Contributor

Based on #69

YAMY1234 and others added 3 commits April 24, 2026 01:48
Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro
(1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell:

  recipes/gb300-fp4/1k1k-dsv4/
    agg-low-latency.yaml       — TP=4 + MTP 3/4   (min TPOT)
    agg-nomtp.yaml             — TP=4             (baseline)
    agg-balanced-tep.yaml      — TP=4+DP=4 DeepEP + MTP 1/2
    agg-max-tpt-tep.yaml       — TP=4+DP=4 DeepEP (max TPS/GPU)
    agg-2n-low-latency.yaml    — TP=8 + MTP 3/4
    agg-2n-nomtp.yaml          — TP=8

  recipes/gb200-fp4/1k1k-dsv4/
    agg-2n-low-latency.yaml    — TP=8 + MTP 3/4
    agg-2n-nomtp.yaml          — TP=8

Flag set derived from the SGLang DSv4 cookbook
(docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx):

  * moe-runner-backend: flashinfer_mxfp4      (MXFP4 MoE on Blackwell)
  * chunked-prefill-size: 4096 + disable-flashinfer-autotune: true
  * EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced
  * TEP recipes: enable-dp-attention + moe-a2a-backend: deepep +
    deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell)
  * disable-radix-cache: true (synthetic bench best practice, also
    reduces allocator fragmentation during MXFP4 weight-reorder)
  * mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during
    reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom)

srtslurm.yaml.example: added deepseek-v4 model + container aliases.

Also adds README.md in each recipe subdir with rationale + reference
pointers to the SGLang cookbook and DSv4-Pro model card.

Made-with: Cursor
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@54badf2). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #70   +/-   ##
=======================================
  Coverage        ?   70.35%           
=======================================
  Files           ?       59           
  Lines           ?     6270           
  Branches        ?        0           
=======================================
  Hits            ?     4411           
  Misses          ?     1859           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ARMUP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YAMY1234 added a commit to YAMY1234/srt-slurm-upstream that referenced this pull request Apr 24, 2026
Adapted from @ishandhanani's dynamo-frontend disagg recipe. 1 prefill
+ 1 decode node at TP=4, MXFP4 FlashInfer MoE runner, NIXL transfer
backend. Benchmark type is ``manual`` for external sweep control.

Lives under ``recipes/gb300-fp4/1k1k-dsv4/`` to match PR NVIDIA#70's
hardware-partitioned layout. Uses ``dsv4-pro`` checkpoint and
``dsv4-grace-blackwell`` container (GB300 aarch64 sglang image), since
DSV4-Flash weights are not staged and the original ``dsflash`` /
``dspro`` / ``nginx`` aliases from the upstream recipe are local to
@ishandhanani's environment.

Dynamo hash pinned: 9d3c913d300eb368cda28b3f98a23a5762621e0d

Made-with: Cursor
ishandhanani pushed a commit that referenced this pull request Apr 24, 2026
* feat(sa-bench): add sglang DeepSeek-V4 tokenizer

Adds a client-side tokenizer for DeepSeek-V4-Pro that matches sglang
server behavior. Usable via the existing 'module.path.ClassName' hook
in backend_request_func.get_tokenizer; no changes to sa-bench itself.

Motivation: DeepSeek-V4 ships no HF chat template, so
tokenizer.apply_chat_template() raises ValueError. Sglang's server
replaces the HF path with a hard-coded DSML encoder
(encoding_dsv4.encode_messages) whenever arch == 'DeepseekV4ForCausalLM',
per sgl-project/sglang PR #23600. Without a matching client-side
encoder, sa-bench input_tokens diverges from server #new-token.

Implementation:
  - sa_bench_tokenizers/_sglang_encoding_dsv4.py: vendored byte-exact
    from sgl-project/sglang@f5d03db (Apache-2.0, 840 lines).
  - sa_bench_tokenizers/sglang_deepseek_v4.py: HF-compatible wrapper.
    apply_chat_template() mirrors serving_chat exactly:
      1) insert empty system message if missing
      2) thinking_mode='chat', reasoning_effort=None (defaults)
      3) call encode_messages(...)
      4) hf_tokenizer.encode(..., add_special_tokens=False)

Usage (recipe):
  benchmark:
    custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"

Recipes: recipe YAML authoring is out of scope; see #70
for DeepSeek-V4-Pro sglang recipes.

* fix(sa-bench): load DeepSeek-V4 via PreTrainedTokenizerFast first

AutoTokenizer.from_pretrained rejects checkpoints whose model_type
(deepseek_v4) is not yet registered in mainline transformers. The V4
checkpoint ships a ready-made tokenizer.json, so prefer loading it
directly through PreTrainedTokenizerFast and only fall back to
AutoTokenizer (for future transformers releases that register V4).

Verified offline on DeepSeek-V4-Pro: 4 representative prompts
(hello, GSM8K, system+user, multi-turn) each produce client token IDs
byte-identical to the sglang server path
tokenizer.encode(encode_messages(msgs_with_empty_system,
thinking_mode=chat)).

* feat(recipes): add gb300 1-node agg smoke recipe for SGLang DeepSeek-V4 tokenizer

Minimal recipe that exercises the new SGLangDeepseekV4Tokenizer wrapper end
to end on a single GB300 node (TP=4, MTP 3/4, MXFP4 MoE). Used to verify
client-side prompt encoding aligns with the SGLang server for DeepSeek-V4.

Key knobs:
- mem-fraction-static: 0.82 (required on 1-node: DSv4-Pro weights occupy
  ~206 GB/GPU, so a lower mfs leaves negative KV pool after SGLang reserves
  (1-mfs)*total_mem for activations, triggering "Not enough memory")
- use_chat_template: true
- custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer

* chore: add NVIDIA SPDX copyright headers to sa_bench_tokenizers files

Required by NVIDIA/srt-slurm CI license check.

* feat(sa-bench): mirror SGLANG_ENABLE_THINKING / SGLANG_REASONING_EFFORT env fallback

The client-side DSV4 tokenizer renders prompts via the vendored
``encoding_dsv4.encode_messages`` and previously hard-coded
``thinking_mode="chat"`` and ``reasoning_effort=None``. This matches the
sglang server only when the user has not set the DSV4 thinking knobs.

sglang ``serving_chat.py`` (PR #23600) honors two envs as fallbacks:

- ``SGLANG_ENABLE_THINKING=1`` -> ``thinking_mode="thinking"``
- ``SGLANG_REASONING_EFFORT=max|high`` -> passed through to the encoder

Recipes typically set these in ``prefill_environment`` /
``decode_environment`` for reasoning eval workloads (gpqa, aime25, etc.)
on DeepSeek-V4-Flash. Without matching fallback on the sa-bench client,
server prompts would be wrapped in ``<think>...</think>`` while the
client still rendered the chat template, desynchronizing ISL / TPOT /
MTP accept-rate accounting.

This change:

- Adds ``_env_enable_thinking()`` / ``_env_reasoning_effort()`` helpers
  that parse the envs the same way sglang ``EnvBool`` / ``EnvStr`` do
  (``{1,true,yes,on}`` truthy; ``max|high`` filter).
- Changes ``apply_chat_template`` defaults from Python literals to
  ``None`` sentinels; when ``None``, falls back to env. Explicit caller
  kwargs (incl. ``thinking=False``) still win, matching the server's
  ``(request.chat_template_kwargs or {}).get("thinking", env_default)``
  precedence.
- Expands the docstring to show the real server call chain (not just
  the happy-path defaults).

Smoke-tested all five precedence cases (no env + no kwarg / env on +
no kwarg / env on + explicit False / env off + explicit True / bogus
env value filtered).

Made-with: Cursor
@ishandhanani ishandhanani merged commit 2cbbede into NVIDIA:main Apr 24, 2026
6 checks passed
@elvischenv elvischenv deleted the dsv4-pro-recipes branch April 27, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants